ggml-hexagon: flash-attention and reduce-sum optimizations by chraac · Pull Request #19141 · ggml-org/llama.cpp

chraac · 2026-01-27T16:12:11Z

Further to the discussion in PR #19025, this implements the dual row dot product for flash attention.

Key changes

HVX Vector Math Optimizations

Added hvx_vec_reduce_sum_qf32x2, a helper function for efficiently reducing and accumulating two HVX vectors of qf32 values, and refactored several places in the codebase to use this function for dual-accumulation scenarios. [1] [2] [3] [4] [5]
Introduced new "rx2" (dual accumulation) versions of dot product functions for both f32-f16 and f16-f16 cases (hvx_dot_f32_f16_aa_rx2, hvx_dot_f16_f16_aa_rx2), improving performance by processing two accumulations in parallel. [1] [2]
Refactored the main attention kernel (flash_attn_ext_f16_thread) to use the new "rx2" dot product functions when possible, improving block processing efficiency.

Performance

Device: 8Gen2
Baseline: 0c21677e4
Optimization: 2610805c4
Model: llama3-1b-q4

stage	Baseline (tok/s)	Optimization (tok/s)	Speedup
prompt eval	61.78	72.44	1.18x
eval	28.74	29.40	1.02x

…ccumulation

…ew vectorized implementations

…x2 functions for improved performance

…function for improved readability

…proved performance

chraac · 2026-01-27T16:17:37Z

ggml/src/ggml-hexagon/htp/hvx-reduce.h

+    sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 4));
+    sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 8));
+    sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 16));
+    return sum01;


Optimize reduction sum by processing two vectors simultaneously.

…ccumulation

…ew vectorized implementations

…x2 functions for improved performance

…function for improved readability

…proved performance

max-krasnyansky · 2026-01-30T06:11:52Z

Overall, good idea and you gave me more ideas to implement/cleanup :).
The reduce_sum_qf32x2 is not correct for V75 and newer (qf32 can't be shifted/rotated, etc).
The good news is that we can just replace all instances with f32 version and optimize that version for V75 and up.
I have a branch with the changes. Will share tomorrow (falling asleep right now) so that you can pull it in.

chraac · 2026-01-30T10:09:37Z

The reduce_sum_qf32x2 is not correct for V75 and newer (qf32 can't be shifted/rotated, etc).
The good news is that we can just replace all instances with f32 version and optimize that version for V75 and up.

Interesting. That explains why we need the extra Q6_Vsf_equals_Vqf32 calls in hvx_vec_reduce_sum_n_qf32. Also curious if there are hidden bits involved in the HVX_Vector fp ops.

max-krasnyansky · 2026-01-31T00:59:28Z

The reduce_sum_qf32x2 is not correct for V75 and newer (qf32 can't be shifted/rotated, etc).
The good news is that we can just replace all instances with f32 version and optimize that version for V75 and up.

Interesting. That explains why we need the extra Q6_Vsf_equals_Vqf32 calls in hvx_vec_reduce_sum_n_qf32. Also curious if there are hidden bits involved in the HVX_Vector fp ops.

Yep. QF32 and QF16 have extra bits that are not visible to the SW.

Here is the branch where I fixed this issue and also went through and made everything consistently use reduce_sum_f32.
vec_dot_mxfp4_rx2 was giving the compiler a hard time with the reduce_sum_x2. The only way I could fix it is to keep the row_sum in f32. It ended up being better for all other cases so I updated all them. The the qf32 -> sf conversion can generally be done in the same instruction packet so it's essentially free.

https://github.com/qualcomm/llama.cpp/tree/hexagon-fa-and-reduce-sum

Tested on Gen3,4,5 and X-Elite.

I'm seeing a nice bump in perf across the board. Not huge but significant.
+5-10 T/S in some cases.
FA performance is steadily catching up to the CPU. Hopefully a few more iterations and we can enable it by default.

Please pull/merge/rebase, see how it does on your setup and I think we're good to merge.

# Conflicts: # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c

…ON_TOOLS_ROOT

chraac · 2026-01-31T05:00:47Z

ggml/src/ggml-hexagon/CMakeLists.txt

+    file(TO_CMAKE_PATH "${HEXAGON_TOOLS_ROOT}" HEXAGON_TOOLS_ROOT)
+    if (NOT IS_DIRECTORY "${HEXAGON_TOOLS_ROOT}")
+        message(FATAL_ERROR "Make sure HEXAGON_TOOLS_ROOT point to the correct Hexagon SDK installation.")
+    endif()


Thinking it may be good to derive HEXAGON_TOOLS_ROOT from hexagon_sdk.json in HEXAGON_SDK_ROOT.

…19141) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

chraac added 7 commits January 25, 2026 21:18

wip

c2adc87

ggml-hexagon: add vectorized dot product function for FP32 and FP16 a…

092ffd6

…ccumulation

ggml-hexagon: optimize dot product functions for FP16 and FP32 with n…

1862e3c

…ew vectorized implementations

wip

6ad8f33

ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32…

e45844f

…x2 functions for improved performance

ggml-hexagon: refactor dot product functions to use a common loading …

71f5bdb

…function for improved readability

optimize vector dot product functions to use unified reduction for im…

2610805

…proved performance

chraac requested review from lhez and max-krasnyansky as code owners January 27, 2026 16:12

chraac marked this pull request as draft January 27, 2026 16:12

chraac commented Jan 27, 2026

View reviewed changes

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 27, 2026

loci-dev mentioned this pull request Jan 27, 2026

UPSTREAM PR #19141: [WIP]ggml-hexagon: flash-attn opt - part2 auroralabs-loci/llama.cpp#1052

Open

chraac and others added 8 commits January 29, 2026 13:58

wip

33291a7

ggml-hexagon: add vectorized dot product function for FP32 and FP16 a…

bc695ad

…ccumulation

ggml-hexagon: optimize dot product functions for FP16 and FP32 with n…

a97089b

…ew vectorized implementations

wip

10fc546

ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32…

8b39844

…x2 functions for improved performance

ggml-hexagon: refactor dot product functions to use a common loading …

6615e48

…function for improved readability

optimize vector dot product functions to use unified reduction for im…

5eac1fd

…proved performance

hexagon: optimize reduce-sum for v75+

1ec2cee

hexagon: always keep row_sums in sf/fp32

e4e8d5a

loci-dev mentioned this pull request Jan 31, 2026

UPSTREAM PR #19141: [WIP]ggml-hexagon: flash-attn opt - part2 auroralabs-loci/llama.cpp#1086

Open

chraac added 3 commits January 31, 2026 10:51

Merge branch 'hexagon-fa-and-reduce-sum' into dev-fa-opt-part2

33e3cf0

# Conflicts: # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c

ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAG…

30a0b35

…ON_TOOLS_ROOT

fix compiling error after rebase

6d7fed6

chraac commented Jan 31, 2026

View reviewed changes

chraac marked this pull request as ready for review January 31, 2026 05:04

chraac changed the title ~~[WIP]ggml-hexagon: flash-attn opt - part2~~ ggml-hexagon: flash-attn opt - part2 Jan 31, 2026

max-krasnyansky changed the title ~~ggml-hexagon: flash-attn opt - part2~~ ggml-hexagon: flash-attention and reduce-sum optimizations Jan 31, 2026

max-krasnyansky approved these changes Jan 31, 2026

View reviewed changes

max-krasnyansky merged commit 89f10ba into ggml-org:master Jan 31, 2026
74 of 75 checks passed

chraac deleted the dev-fa-opt-part2 branch January 31, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-hexagon: flash-attention and reduce-sum optimizations#19141

ggml-hexagon: flash-attention and reduce-sum optimizations#19141
max-krasnyansky merged 19 commits intoggml-org:masterfrom
chraac:dev-fa-opt-part2

chraac commented Jan 27, 2026 •

edited

Loading

Uh oh!

chraac Jan 27, 2026

Uh oh!

max-krasnyansky commented Jan 30, 2026

Uh oh!

chraac commented Jan 30, 2026

Uh oh!

max-krasnyansky commented Jan 31, 2026 •

edited

Loading

Uh oh!

chraac Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chraac commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes

HVX Vector Math Optimizations

Performance

Uh oh!

chraac Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Jan 30, 2026

Uh oh!

chraac commented Jan 30, 2026

Uh oh!

max-krasnyansky commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chraac Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chraac commented Jan 27, 2026 •

edited

Loading

max-krasnyansky commented Jan 31, 2026 •

edited

Loading